Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 23 commits into
development/9.3from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 23 commits into
development/9.3from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha
Copy link
Copy Markdown
Contributor

@benzekrimaha benzekrimaha commented Mar 2, 2026

Issue: BB-740

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 97.47899% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.54%. Comparing base (145f7a6) to head (de1170d).
⚠️ Report is 46 commits behind head on development/9.3.

Files with missing lines Patch % Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 80.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/LifecycleConfigValidator.js 100.00% <100.00%> (ø)
extensions/lifecycle/LifecycleMetrics.js 100.00% <100.00%> (+1.78%) ⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js 84.40% <100.00%> (+0.69%) ⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.90% <100.00%> (+0.14%) ⬆️
extensions/lifecycle/tasks/LifecycleTask.js 91.63% <100.00%> (+0.08%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.99% <100.00%> (+0.10%) ⬆️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 81.33% <100.00%> (+0.25%) ⬆️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 92.15% <100.00%> (+0.23%) ⬆️
lib/models/ActionQueueEntry.js 96.29% <ø> (ø)
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 80.83% <80.00%> (+0.96%) ⬆️

... and 6 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 80.57% <ø> (-0.13%) ⬇️
Ingestion 70.53% <ø> (-0.62%) ⬇️
Lifecycle 79.27% <97.47%> (+0.65%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.61% <ø> (-0.04%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
+ Coverage            74.48%   74.54%   +0.05%     
===================================================
  Files                  200      200              
  Lines                13603    13690      +87     
===================================================
+ Hits                 10132    10205      +73     
- Misses                3461     3475      +14     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.09% <0.84%> (-0.06%) ⬇️
api:routes 8.91% <0.84%> (-0.06%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.13% <10.08%> (-0.89%) ⬇️
ingestion 12.42% <0.84%> (-0.13%) ⬇️
lifecycle 19.04% <67.22%> (+0.19%) ⬆️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.45% <10.08%> (-0.04%) ⬇️
unit 51.45% <90.75%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
const log = this.logger.newRequestLogger();
const start = new Date();
const start = Date.now();
this._scanId = uuid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleTask.spec.js
Comment thread extensions/lifecycle/tasks/LifecycleTask.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

  • tests/unit/lifecycle/LifecycleTask.spec.js:719 — Bug: test asserts contextInfo.requestId is set and reqId is undefined, but _getBucketEntryContext produces reqId. Test will fail. Either fix the test assertions or override _getBucketEntryContext in LifecycleTaskV2 to preserve the old requestId key.
  • extensions/lifecycle/tasks/LifecycleTask.js:140 — Behavior change: V2 continuation entries now use reqId instead of requestId in contextInfo, changing the Kafka message format for V2 continuation entries during rolling upgrades. Low impact (log correlation only) but worth confirming intentional.
  • extensions/lifecycle/conductor/LifecycleConductor.js:530 — scanId parameter added to listBuckets, listBucketdBuckets, and listMongodbBuckets but unused in all three — already available via this._currentScanId.

Review by Claude Code

Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

  • tests/unit/lifecycle/LifecycleConductor.spec.js:325 — listBuckets stub uses 4 params (mQueue, scanId, log, cb) but method signature is (queue, log, cb) (3 params). cb is always undefined, causing TypeError. This breaks an existing test that was working before.
    - tests/unit/lifecycle/LifecycleConductor.spec.js:240 — Same mismatch in new test 'should generate a conductorScanId'. The UUID assertion on scanId also fails because it receives the logger object, not a string. Use conductor._currentScanId instead.
    - tests/unit/lifecycle/LifecycleConductor.spec.js:251 — Same mismatch in new test 'should not close scan metrics when throttling occurs after scan start'.

    All three are the same root cause: the test author assumed scanId would be passed as a parameter to listBuckets, but the method signature was never updated. The scan ID is available via conductor._currentScanId (set before listBuckets is called).

    Review by Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

LGTM — solid implementation of conductor scan tracking.

Reviewed: new Prometheus metrics (s3_lifecycle_latest_batch_end_time, s3_lifecycle_latest_batch_bucket_count, s3_lifecycle_bucket_processor_scan_messages_processed_total, s3_lifecycle_bucket_processor_scan_message_age_seconds), LifecycleStalledScan and LifecycleBucketProcessorMultipleParallelScans alerts, scan context propagation through conductor → bucket processor → lifecycle tasks → action entries, Grafana dashboard panels, and comprehensive test coverage.

Key design choices verified:
- conductor_scan_id label cardinality is managed by cleanup timers in prom-client; Prometheus TSDB retention is documented as an accepted tradeoff
- Rolling upgrade safety: undefined conductorScanId from old conductors is handled gracefully (try-catch in metrics, early-return guards on timestamp validation)
- Throttling-after-scan-start intentionally leaves scan state intact so LifecycleStalledScan alert can fire
- V1/V2 continuation entries correctly preserve scan context via _makeContinuationEntry while maintaining the reqId vs requestId distinction

One minor observation: BackbeatTestConsumer._expectedConductorScanId and _expectedConductorScanStartTimestamp (lines 94-95) are assigned but never read — they appear to be dead code.

Review by Claude Code

Comment thread tests/utils/BackbeatTestConsumer.js Outdated
Comment thread extensions/lifecycle/LifecycleMetrics.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

  • extensions/lifecycle/LifecycleMetrics.js:345 — During rolling upgrades, conductorScanId is undefined for messages from old conductors, creating a conductor_scan_id="undefined" Prometheus label that could cause a spurious LifecycleBucketProcessorMultipleParallelScans alert during the upgrade window. Consider guarding with a falsy check or using a sentinel value.
  • extensions/lifecycle/conductor/LifecycleConductor.js:478 — Pre-existing: when listBuckets returns Throttling after scan start, _batchInProgress stays true and the new _currentScanId/_currentScanStartTimestamp are not cleared. This PR's new test codifies this behavior, but it can leave the conductor stuck until restart. Consider cleaning up scan state on all error paths.

Review by Claude Code

Comment thread extensions/lifecycle/LifecycleMetrics.js
Comment thread extensions/lifecycle/LifecycleMetrics.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Comment thread monitoring/lifecycle/alerts.yaml
Comment thread extensions/lifecycle/tasks/LifecycleTask.js
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

PR #2723: monitor lifecycle conductor — Review Summary

This PR adds comprehensive lifecycle conductor scan monitoring: a scan id (UUID v4) propagated through the Kafka message pipeline, new Prometheus metrics (scan end time, bucket count, per-scan message counter, message age histogram), Grafana dashboard panels, and two new alerts (LifecycleStalledScan, LifecycleBucketProcessorMultipleParallelScans). The refactoring of context propagation via _makeContinuationEntry, _getActionContext, and _getScanContext is clean and reduces duplication across LifecycleTask/LifecycleTaskV2. Tests are thorough.

Findings:

  • monitoring/lifecycle/alerts.yaml:106 — LifecycleStalledScan uses or vector(0) for the end-time metric, which will trigger a false positive on fresh deployments where no scan has ever completed. The first scan's start time will always be greater than 0, satisfying the condition immediately.
  • extensions/lifecycle/LifecycleMetrics.js:109 — The conductor_scan_id label on the per-scan counter creates unbounded time series (one per UUID). The cleanup timer mitigates this, but resets on every message, meaning active scans never get cleaned up. A hard cap on concurrent tracked scan ids would provide a safety valve.
  • extensions/lifecycle/LifecycleMetrics.js:165configureLifecycleScanMetricRetention does not reschedule existing timers; only future timers use the new value. Minor issue since config is loaded at startup.
  • extensions/lifecycle/conductor/LifecycleConductor.js:476elapsedMs measures from a start variable that gets reassigned mid-waterfall, so error paths that short-circuit before the reassignment measure a slightly different time span than expected.

No security issues, no breaking changes to Kafka message formats (new fields are additive and handled gracefully during rolling upgrades).

Review by Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

LGTM — well-structured PR. The scan-context propagation through conductor → bucket-task messages → action entries is thorough and consistent. Metric cleanup timers correctly prevent unbounded prom-client memory growth. Alert PromQL for LifecycleStalledScan handles the edge cases (first scan, scan completion, vector(0) fallback). V1/V2 continuation-entry context is correctly preserved via _makeContinuationEntry with the LifecycleTaskV2 override of _getBucketEntryRequestIdContext. Test coverage is solid across metrics, conductor lifecycle, scan-context propagation, and the BackbeatTestConsumer sentinel handling.

Review by Claude Code

nBucketsQueued,
});
if (scanStarted) {
this._completeCurrentScan(log, totalBucketsListed);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the error-after-start path we still call _completeCurrentScan, which sets latest_batch_end_time and latest_batch_bucket_count. Two consequences:

  1. A failed scan then looks "completed" to LifecycleStalledScan (the alert sees end ≥ start), so a scan that started, errored, and wedged won't trip the stalled-scan alert it's meant to catch.
  2. bucket_count is published with a misleading partial value.

Resetting _currentScanId/_currentScanStartTimestamp on failure is correct, but should we record the end metrics on the error path at all? Suggest resetting scan state without setting end-time/bucket-count when the scan failed.

conductorScanId: scanId,
conductorScanStartTimestamp: start,
});
LifecycleMetrics.onProcessBuckets(log, start);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the meaning of an existing metric. s3_lifecycle_latest_batch_start_time used to be set at scan completion (previously onProcessBuckets was called in the success callback); it's now set at scan start.

That's a behavior change for anything already reading this metric — in particular LifecycleLateScan. After this change LateScan means "the conductor hasn't started a scan recently" and no longer catches "a scan started but hung"; that coverage moves entirely to the new LifecycleStalledScan. So the two new StalledScan rules aren't purely additive — they backfill coverage this flip removed from LateScan.

Can we confirm this is intended and call it out explicitly in the PR description / changelog so downstream dashboards and alerts are aware?

});

const bucketProcessorScanMessageAgeSeconds = ZenkoMetrics.createHistogram({
name: 's3_lifecycle_bucket_processor_scan_message_age_seconds',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text says the age is measured "when they finish processing in the bucket processor," but onBucketProcessorScanMessageReceived is called at message pickup (in _processBucketEntry, before fetching the bucket lifecycle config or scheduling the task). So this histogram actually measures "elapsed wall-time since the scan started, sampled at dequeue" — a backlog/lag signal, not processing time. Continuation slices also inherit the original scan-start timestamp, so age keeps growing across a long scan regardless of when a given slice was enqueued.

Either fix the help text to describe what's measured, or move the observation to actual task completion if processing time is what we want.

});

const bucketProcessorScanMessagesProcessed = ZenkoMetrics.createCounter({
name: 's3_lifecycle_bucket_processor_scan_messages_processed_total',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: this counter is incremented at message receipt, before processing and regardless of success or object count (the JSDoc on onBucketProcessorScanMessageReceived says as much), yet it's named ..._scan_messages_processed_total and the method says ...Received. "processed" overstates what it counts. Suggest ..._scan_messages_received_total to match the semantics and the method name.

assert(parsedMsg.contextInfo?.reqId, 'expected contextInfo.reqId field');
parsedMsg.contextInfo.reqId = expectedMsg.value.contextInfo?.reqId;
expectedValue.contextInfo.reqId = parsedMsg.contextInfo.reqId;
if (expectedValue.contextInfo?.conductorScanId === 'test-scan-id') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on François's earlier point about this utility: this doesn't actually test scan-id propagation. It overwrites the expected conductorScanId/conductorScanStartTimestamp with whatever the actual message contained, and never asserts the fields are present — so a message that omitted them entirely would still pass deepStrictEqual. Unlike the reqId case just above, there's no assert(parsedMsg.contextInfo?.conductorScanId, ...) guarding presence.

Either add a presence assertion (mirroring the reqId check) or move the scan-id assertion back into the test rather than the shared consumer.

processActionEntry(entry, done) {
const startTime = Date.now();
const log = this.logger.newRequestLogger();
const conductorScanId = entry.getContextAttribute('conductorScanId');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency: there are now three different ways scan context gets onto the logger across the action tasks. LifecycleUpdateExpirationTask does log.addDefaultFields(entry.getLogInfo()), while here and in LifecycleUpdateTransitionTask we manually getContextAttribute('conductorScanId') + getContextAttribute('conductorScanStartTimestamp') + addDefaultFields.

Since ActionQueueEntry._loggedAttributes now includes both scan fields, the manual extraction is redundant — these two could use entry.getLogInfo() too (this is the direction François asked for earlier). Standardizing on one approach would be cleaner.

Comment thread conf/config.json
}
},
"concurrency": 10,
"scanMetricRetentionS": 86400,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retention default now lives in three places: 86400 here, and DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60 duplicated in both LifecycleMetrics.js and LifecycleConfigValidator.js. Suggest a single exported constant (re-used by the validator's .default(...)) so these can't drift apart.

});

describe('_indexesGetOrCreate', () => {
it('should include conductor scan id in task context', () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test exercises _taskToMessage, but it's placed under describe('_indexesGetOrCreate'). Minor, but worth moving it to a block that matches what it covers (or its own describe('_taskToMessage')).

}

const ageSeconds = (Date.now() - conductorScanStartTimestamp) / 1000;
if (ageSeconds >= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ageSeconds >= 0 guard drops the observation entirely when the computed age is negative, rather than clamping it to 0. Since the age is a cross-host subtraction (Date.now() here minus the conductor's Date.now() carried in the message), small negative values are expected and dropping them silently removes the fastest samples, biasing the histogram upward. Suggest observe(..., Math.max(0, ageSeconds)) so those samples still land in the lowest bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants